Overview


  • 1,240 samples collected, captured, and sequenced from 13 locations across Tanzania
    • I have sequencing data for 1,237 of these. Notes on plate maps that some samples were missing (?).
  • Genotyped using:
    • ~1800 MIP panel
      • Targets 1,800 SNPs across the genome (combination of neutral and under selection)
      • 3 total sequencing runs
  • 734 samples brought forward for poulation structure analysis


Admixture Analysis


  • Admixture analysis: Tested K=1 through 10
    • 10 replicates for each K
    • The rep with the highels log-likelihood value was chosen for each K
    • That rep’s cross-validation (CV) error was plotted to determine the K populations present in our data set


  • Pretty obvious answer of K=2 there.

  • For K=2, I plotted the proportion of each sample’s genome that was assigned to eith er K1 or K2:



Given the high amounts of missingess from some of the regions, I wanted to make sure that missingness doesn’t predict a sample’s assignment to one of the two populations. The below plot shows frequency of missing genotypes (y-axis) across samples (x-axis); each sample is colored by the proportion of the genome assigned to K2:

  • I don’t see any relationship between missingness, location, and K assignment.


PCA-based Analyses

  • We also did PCA and PCoA analyses shown below.


  • Both show a gradient of separation between southern districts and districts in the north and northwest, with samples from Kibaha sort of spanning this gradient
    • To see this better, click on Kibaha in the legends above to remove it from the figure)


  • To summarise:
    • We do identify two main populations (K1 and K2) in our data set, with the northern regions being more admixed than the southern regions
    • However, population structure is probably best represented as a continious graident (as per the unsupervised PCoA/PCA approaches).


  • I also tried k-means clustering, which gives very similar results to the above admixture plot.
  • I also tried discrimitory analysis of principle componants (DAPC) to see if we could get any more discrimination by district.
    • After some testing, I included ~140 compents in the DAPC, and we do see some additional separation by district.
    • Additional spread among the north and north-western distrcits.
    • NOTE for html: not rendered due to run time; code found in Rmarkdown document


Variants Contributing to PCA-based methods

  • Moving forward with probably use PCA’s so we can utilizing the loading componants
    • Below shows the contribution of the loading value to componants one and two, expressed as a percent:


## [1] 0.01368808
## [1] 0.009160282
## [1] 0.007722963


  • Overall, individual contributions to each componant are small

-Regions of interest on: - PC1: Two hits on chrm2 () and one on chrm11 - PC2: Region upstream of dhps on chrm 8 and aat1 on chrm 6 - PC3: dhps, crt


  • The hits on chromosome 2 and 11 for PC1 match really well with the K =2 populations we identified in the admixture analysis


Random Forest


  • Using PCA-based approaches, we identified cerrtain SNPs that papeared to be correlating with geography
  • Using more formal machine learning approaches (ie random forest), can we identify additional SNPs? Or, confirm that the same SNPs come up?


  • Started with the data set from the PCA data
    • Imputed WSAF to remove missing sites
    • Called major allele
  • Used the randomForest package to conduct supervised machine learning
    • Tested categorization method for North/South (Kibaha removed)
    • Tested categorization method for K1/K2
    • Tested regression method for % of genome in K1/K2


  • Here’s the results of the RF model for categrozing North/South samples:
## 
## Call:
##  randomForest(formula = loc2 ~ ., data = train, importance = TRUE,      ntree = 10000, mtry = 200, proximity = TRUE, type = classification) 
##                Type of random forest: classification
##                      Number of trees: 10000
## No. of variables tried at each split: 200
## 
##         OOB estimate of  error rate: 20.45%
## Confusion matrix:
##       North South class.error
## North   113    54   0.3233533
## South    18   167   0.0972973
##        predictions.test.data
##         North South
##   North    42    14
##   South     5    57

## [[1]]
## [1] 0.921947

  • Looks like the model is sucks at putting things correctly into the northern category
  • We can slightly improve this by adjusting the rf paramaters, but it’s nothing to write home about

  • Here are the SNPs associated with the north/south divide, as well as some QC:

  • The top hits in chr 2 are in the acytl coa gene that was also identified from the loading componants of our PCA (as well as Bob’s anlaysis from the DRC data)
  • You can see by the x-axis that, like the PCA loading values, these SNPs contribute individually very small amounts to the overal model (or, how much would the model decrease in accuracy if you removed that SNP from the model?)


  • Does the model do better when attempting to categorize by K1 or K2?
## 
## Call:
##  randomForest(formula = pop ~ ., data = train, importance = TRUE,      ntree = 10000, mtry = 200, type = classification) 
##                Type of random forest: classification
##                      Number of trees: 10000
## No. of variables tried at each split: 200
## 
##         OOB estimate of  error rate: 8.5%
## Confusion matrix:
##     0   1 class.error
## 0 143  29  0.16860465
## 1  18 363  0.04724409
##    prediction.test.data
##       0   1
##   0  44  13
##   1   5 122

## [[1]]
## [1] 0.8976378

  • better…. below are the SNPs and check sthat we ran enough trees

  • Similar overlap in SNPs


  • Making final figure for the above models


  • NOTE: not sure seeds are working, results from re-running notebook may look slightly different from Moser et al.
  • TO DO: fix seed issue ## Some additional checks

  • How about treating K as a continious variable (so using regression instead of classification):

## 
## Call:
##  randomForest(formula = K1 ~ ., data = train, importance = TRUE,      ntree = 1000, mtry = 200) 
##                Type of random forest: regression
##                      Number of trees: 1000
## No. of variables tried at each split: 200
## 
##           Mean of squared residuals: 0.02553361
##                     % Var explained: 79.35


  • Below is a smorgasborg of QC (above model 1).
    • Bottom line: nothing I wouldn’t expect, given we have a lot of variables contributing a small amount.
##    tree      variable minimal_depth
## 1     1   chr1_138823             6
## 2     1   chr1_155939             8
## 3     1   chr1_349253            28
## 4     1   chr1_480434            15
## 5     1   chr1_536003             5
## 6     1   chr1_536315            13
## 7     1   chr1_537427            14
## 8     1 chr10_1063886            11
## 9     1  chr10_108733             3
## 10    1 chr10_1128179            14


  • Keeping top 300 samples with the least amount of missing data just as a sanity check
## 
## Call:
##  randomForest(formula = loc2 ~ ., data = train, importance = TRUE,      ntree = 1000) 
##                Type of random forest: classification
##                      Number of trees: 1000
## No. of variables tried at each split: 40
## 
##         OOB estimate of  error rate: 15.11%
## Confusion matrix:
##       North South class.error
## North    76    27  0.26213592
## South     7   115  0.05737705
##        prediction.test.data
##         North South
##   North    20    14
##   South     4    37